Skip to content

Conversation

@kakra
Copy link
Owner

@kakra kakra commented Nov 23, 2024

Export patch series: https://github.com/kakra/linux/pull/36.patch

Here's a good guide by @Forza-tng: https://wiki.tnonline.net/w/Btrfs/Allocator_Hints. Please leave them a nice comment. Thanks. :-)

  • Allocator hint patches: Allows to prefer SSDs for meta-data allocations while excluding HDDs from meta-data allocation, greatly improves btrfs responsiveness, file system remains compatible with non-patched systems but won't honor allocation preferences then (re-balance needed to fix that after going back to a patched kernel)
  • RAID1 read balance patches: Allows to round-robin RAID1 read requests across multiple disks, or prefer the device with the lowest latency. In some non-scientific tests, this can easily cut loading times in some games in half.

Allocator hints

To make use of the allocator hints, add these to your kernel. Then run btrfs device usage /path/to/btrfs and take note of which device IDs are SSDs and which are HDDs.

Go to /sys/fs/btrfs/BTRFS-UUID/devinfo and run:

  • echo 0 | sudo tee HDD-ID/type to prefer writing data to this device (btrfs will then prefer allocating data chunks from this device before considering other devices) - recommended for HDDs, set by default
  • echo 1 | sudo tee SSD-ID/type to prefer writing meta-data to this device (btrfs will then prefer allocating meta-data chunks from this device before considering other devices) - recommended for SSDs
  • There's also type 2 and 3 which write meta-data only (2) or data only (3) to the specified device - not recommended, can result in early no-space situations
  • Added 2024-06-27: Type 4 can be used to avoid allocating new chunks from a device, useful if you plan on removing the device from the pool in the future: echo 4 | sudo tee LEGACY-ID/type
  • Added 2024-12-06: Type 5 can be used to prevent allocating any chunks from a device, useful if you plan on removing multiple devices from the pool in parallel: echo 5 | sudo tee LEGACY-ID/type
  • NEVER EVER use type 2 or 3 if you only have one type of device unless you know what you do and why you are doing this
  • The default "preferred" heuristics (0 and 1) are good enough because btrfs will always allocate from devices with most space first (respecting the "preferred" type with this patch)
  • After changing the values, a one-time meta-data and/or data balance (optionally filtered to the affected device IDs) is needed

Important note: This recommends to use at least two independent SSDs so btrfs meta-data raid1 requirement is still satisfied. You can, however, create two partitions on the same SSD but then it's no longer protected against hardware faults, it's essentially dup-quality meta-data then, not raid1. Before sizing the partitions, look at btrfs device usage to find the amount of meta-data, at least double that size to size your meta-data partitions.

This can be combined with bcache by directly using meta-data partitions as a native SSD partition for btrfs, and only using data partitions routed through bcache. This also takes a lot of meta-data pressure from bcache, making it more efficient and less write-wearing as a result.

Real-world example

In this example, sde is a 1 TB SSD having two meta-data partitions (2x 128 GB) with the remaining space dedicated to a single bcache partition attached to my btrfs pool devices:

# btrfs device usage /
/dev/bcache2, ID: 1
   Device size:             3.63TiB
   Device slack:            3.50KiB
   Data,single:             1.66TiB
   Unallocated:             1.97TiB

/dev/bcache0, ID: 2
   Device size:             3.63TiB
   Device slack:            3.50KiB
   Data,single:             1.66TiB
   Unallocated:             1.97TiB

/dev/bcache1, ID: 3
   Device size:             2.70TiB
   Device slack:            3.50KiB
   Data,single:           752.00GiB
   Unallocated:             1.96TiB

/dev/sde4, ID: 4
   Device size:           128.00GiB
   Device slack:              0.00B
   Metadata,RAID1:         27.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           100.97GiB

/dev/sde5, ID: 5
   Device size:           128.01GiB
   Device slack:              0.00B
   Metadata,RAID1:         27.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           100.98GiB

# bcache show
Name            Type            State                   Bname           AttachToDev
/dev/sdd2       1 (data)        dirty(running)          bcache1         /dev/sde2
/dev/sdb2       1 (data)        dirty(running)          bcache2         /dev/sde2
/dev/sde2       3 (cache)       active                  N/A             N/A
/dev/sdc2       1 (data)        clean(running)          bcache3         /dev/sde2
/dev/sda2       1 (data)        dirty(running)          bcache0         /dev/sde2

A curious reader may find that sde1 and sde3 are missing, which is my EFI boot partition (sde1) and swap space (sde3).

Read Policies aka "RAID1 read balancer"

To use the balancer, CONFIG_BTRFS_EXPERIMENTAL=y is needed while building the kernel. The balancer offers six modes:

  • pid provides the old PID-based balancer and is the default.
  • round-robin provides the best performance if all member disks are of the same type. Performance may be unstable with bcache. Preferred for parallel workloads like servers.
  • latency provides best performance if using asymmetric RAID configurations with mixed device types (e.g. SSD paired with HDD). Performance may be unstable with bcache but may increase bcache hit rates over time. Preferred for latency-sensitive workloads like desktop.
  • latency-rr tries to combine round-robin and latency into one hybrid approach by using round-robin across a set of stripes within a 125% margin of the best latency. I am currently testing this and have not yet discovered the benefits or downsides but in theory it should prefer the fastest stripes for small requests while it switches over to using all stripes for large continuous requests. Technically, this works either as latency or round-robin with just two mirrors, should work better with more mirrors.
  • queue diverts each request to a stripe with the shortest IO queue (in-flight requests) and works exceptionally (and unexpected) well with all workloads I tested: It massively outperforms all other policies in each benchmark scenario and in virtualization workloads.
  • devid prefers reading from a specified device ID. Similar to latency, it provides best performance if you know one drive is faster than another. This could stabilize performance with bcache as it will prefer caching data only of specified the disk. Defaults to the latest disk added to the pool.

Combined with bcache, performance is unstable with both round-robin and latency because latency and throughput depend on whether data is cached or not - but still it is overall better than the old PID balancer.

Unexpectedly, queue performs exceptionally well on my mixed device setup. YMMV with identical member devices. It outperforms all other policies in each discipline and benchmark.

To use the balancer, use btrfs.read_policy=<pid,round-robin,latency,latency-rr,queue,devid:#> on the kernel cmdline. There's also a sysfs interface at /sys/fs/btrfs/<UUID>/read_policy to switch balancers on demand (e.g., for benchmarks). See modinfo btrfs for more information.

Benchmark results: https://gist.github.com/kakra/ce99896e5915f9b26d13c5637f56ff37

Note 1: The latency calculation currently uses an average of the full history of requests only - which is bad because it will cancel out changing variations over time. A better approach would be to use an EMA (exponential moving average) with an alpha of 1/8 or 1/16. This requires to sample individual bio latency and thus requires to change structs and code in other parts of btrfs. I'm not very familiar with all the internal structures yet, and the feature is still guarded by CONFIG_BTRFS_EXPERIMENTAL, making that approach more complex. OTOH, having a permanent EMA right in the bio structures of btrfs could prove useful in other areas of btrfs.

Note 2: In theory, both latency modes should automatically prefer faster zones of HDDs and properly switch stripes automatically. In practice, this is probably overruled by note 1 except most of your data is in specific zones by coincidence, in which case the average would properly hold some sort of "zone performance".

Note 3: With high CPU core counts, queue might have a measurable CPU overhead due to queue length calculation (per-core counters have to be summed per each request).

Real-world example

Some simple tests have shown that both the round-robin and latency balancers can increase throughput while loading games from 200-300 MB/s to 500-700 MB/s when combined with bcache.

Important note: This will be officially available with kernel 6.15 or later, excluding the latency balancer. I've included it because I think it can provide better performance in edge cases, e.g. asymmetric RAID or bcache. It may also provide better performance on the desktop because on the desktop, latency is more important than throughput. The latency balancer is thus an experiment and may go away. But I will keep it until at least the next LTS kernel.


Description / instructions for balancing

(AI generated after training with some stats, observations and incremental development steps)

Interpreting the Btrfs read_stats Sysfs Output

The /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/read_stats file, enhanced by these patches, offers valuable insights into the dynamic read balancing behavior and performance of individual devices within a Btrfs RAID1/10/1C3/4 setup. Here's a breakdown of the fields:

cumulative ios %lu wait %llu avg %llu checkpoint ios %ld wait %lld avg %llu age %lld count %llu ignored %lld
  • cumulative ios %lu: The total count of read I/O operations completed on this specific device since the filesystem was mounted.

  • cumulative wait %llu: The total time (in nanoseconds) accumulated waiting for all cumulative read IOs on this device.

  • cumulative avg %llu: The long-term average read latency (cumulative wait / cumulative ios) in nanoseconds. This represents the device's average performance over its entire operational history within the current mount. It changes very slowly and can be heavily influenced by caching layers (like bcache or the page cache) if present.

  • checkpoint ios %ld: The number of read IOs completed since the last checkpoint. A checkpoint is established when a device undergoes "rehabilitation" – meaning its age counter reached the BTRFS_DEVICE_LATENCY_CHECKPOINT_AGE threshold, triggering a read probe and a reset of these checkpoint statistics. For devices that have never been rehabilitated, this value will equal cumulative ios.

  • checkpoint wait %lld: The total time (in nanoseconds) accumulated waiting for reads since the last checkpoint.

  • checkpoint avg %llu: The average read latency (checkpoint wait / checkpoint ios) calculated only using the IOs since the last checkpoint. This is a key metric reflecting recent performance. It's much more responsive to current conditions than the cumulative average, especially after a period of being ignored.

  • age %lld: This counter tracks how "stale" the device is in terms of read selection. It increments each time a read balancing decision is made for a stripe group containing this device, but another device from that group is chosen.

    • 0: The device was selected for a read very recently (in the last relevant balancing decision).
    • > 0: The device has been ignored for this many consecutive selection events where it was a candidate. A high value indicates it's consistently considered slower or less preferred than its peers.
    • < 0: The device has just been rehabilitated (hit the age threshold). It is now in a "burst IO" probation period (e.g., starting at -100 and incrementing towards 0). During this negative age phase, its reported latency is forced to 0 to guarantee it receives reads.
  • count %llu: The number of times this device has triggered the rehabilitation mechanism by reaching the age threshold. A high count suggests the device is frequently deemed too slow by the latency policy or is subject to other selection biases (like non-balancing metadata reads).

  • ignored %lld: A counter incremented every time this device was a candidate for a read, but the balancing policy ultimately selected a different device from the same stripe group. This provides insight into how often the policy actively chooses a peer over this device, indicating relative preference or "fairness" of the algorithm.

How to Use These Stats:

  • Identify Slow Devices: Devices consistently showing a high checkpoint avg (compared to peers) and a high, frequently reset age are likely performance bottlenecks under the current load.
  • Assess Policy Effectiveness: Compare cumulative avg and checkpoint avg. A large difference after rehabilitation (count > 0) shows the policy is adapting to performance changes more quickly than the cumulative average would suggest.
  • Detect Selection Bias: A device with a good checkpoint avg but a persistently high age and ignored count (like NVMe metadata mirrors sometimes exhibit) points towards a selection bias not based purely on latency.
  • Tune Rehabilitation: The age and count values help evaluate the AGE_THRESHOLD and IO_BURST parameters. If age hits the threshold very frequently, it might be too low. If checkpoint ios barely increases after a reset, the IO_BURST might be too short (or the device becomes slow again immediately).

These enhanced statistics provide a powerful diagnostic tool for understanding and fine-tuning Btrfs's read balancing behavior in complex, real-world storage environments.

kreijack and others added 5 commits November 18, 2024 15:26
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:

- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
  preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
  preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Kai Krakow <kai@kaishome.de>
When this mode is enabled, the chunk allocation policy is modified as
follow.

Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.

Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
This is useful where you want to prevent new allocations of chunks on a
disk which is going to removed from the pool anyways, e.g. due to bad
blocks or because it's slow.

Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks to
a set of multiple disks which are going to be removed from the pool.
This acts as a multiple `btrfs dev remove` on steroids that can remove
multiple disks in parallel without moving data to disks which would be
removed in the next round. In such cases, it will avoid moving the
same data multiple times, and thus avoid placing it on potentially bad
disks.

Thanks to @Zygo for the explanation and suggestion.

Link: kdave/btrfs-progs#907 (comment)
Signed-off-by: Kai Krakow <kai@kaishome.de>
@tanriol
Copy link

tanriol commented Feb 6, 2025

Hi. What's the status of these patches? Are these something that's going be upstream in a reasonable amount of time or a long-term external patch series?

@kakra
Copy link
Owner Author

kakra commented Feb 7, 2025

These won't go in into the kernel as-is and may be replaced by some different implementation in the kernel sooner or later. But I keep those safe to use - aka they don't create incompatibilities with future kernels and can just be dropped from your kernel without posing any danger to your btrfs.

@Forza-tng has some explanations why those patches won't go into the kernel: https://wiki.tnonline.net/w/Btrfs/Allocator_Hints

@Forza-tng
Copy link

Hi.

I noticed that type 0 data preferred has higher priority than type 3 data only. This can lead to interesting cases. For example on one server i had replaced two disks but forgot to set type 3, so they were left as type 0, while the other disks were type 3.

The result for this RAID10 was that new data were stored only on the new disks (ID 15 and 16) with a 2 stripe RAID10 instead of the expected 10 stripe RAID10.
YHEDUPLb6aDv_Skärmbild2025-02-14142956

While this was unintended, perhaps this effect could be used to tier data chunks? To test this, I created a 3 device btrfs:

1 = ssd
2 = hdd
3 = nvme

❯ grep . /sys/fs/btrfs/238f21dc-8199-4eaa-b503-ecd2983456d6/devinfo/*/type
1/type:0x00000000 # data preferred
2/type:0x00000003 # data only
3/type:0x00000002 # metadata only

❯ btrfs fi us -T .
Overall:
    Device size:		 115.00GiB
    Device allocated:		   6.03GiB
    Device unallocated:		 108.97GiB
    Device missing:		     0.00B
    Device slack:		     0.00B
    Used:			   3.00GiB
    Free (estimated):		 110.97GiB	(min: 110.97GiB)
    Free (statfs, df):		 110.97GiB
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		   5.50MiB	(used: 16.00KiB)
    Multiple profiles:		        no

              Data    Metadata System                              
Id Path       single  single   single   Unallocated Total     Slack
-- ---------- ------- -------- -------- ----------- --------- -----
 1 /dev/loop0 5.00GiB        -        -     5.00GiB  10.00GiB     -
 2 /dev/loop1       -        -        -   100.00GiB 100.00GiB     -
 3 /dev/loop2       -  1.00GiB 32.00MiB     3.97GiB   5.00GiB     -
-- ---------- ------- -------- -------- ----------- --------- -----
   Total      5.00GiB  1.00GiB 32.00MiB   108.97GiB 115.00GiB 0.00B
   Used       3.00GiB  3.17MiB 16.00KiB                            



❯ dd if=/dev/zero of=file6.data count=10000000

❯ btrfs fi us -T .
Overall:
    Device size:                 115.00GiB
    Device allocated:             10.03GiB
    Device unallocated:          104.97GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                          8.78GiB
    Free (estimated):            105.20GiB      (min: 105.20GiB)
    Free (statfs, df):           105.20GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:                7.67MiB      (used: 0.00B)
    Multiple profiles:                  no
              Data    Metadata System
Id Path       single  single   single   Unallocated Total     Slack
-- ---------- ------- -------- -------- ----------- --------- -----
 1 /dev/loop0 9.00GiB        -        -     1.00GiB  10.00GiB     -
 2 /dev/loop1       -        -        -   100.00GiB 100.00GiB     -
 3 /dev/loop2       -  1.00GiB 32.00MiB     3.97GiB   5.00GiB     -
-- ---------- ------- -------- -------- ----------- --------- -----
   Total      9.00GiB  1.00GiB 32.00MiB   104.97GiB 115.00GiB 0.00B
   Used       8.77GiB  9.08MiB 16.00KiB
   
   
❯ dd if=/dev/zero of=file7.data count=10000000

❯ btrfs fi us -T .
Overall:
    Device size:                 115.00GiB
    Device allocated:             15.03GiB
    Device unallocated:           99.97GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         13.03GiB
    Free (estimated):            100.95GiB      (min: 100.95GiB)
    Free (statfs, df):           100.95GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:               12.81MiB      (used: 0.00B)
    Multiple profiles:                  no
              Data     Metadata System
Id Path       single   single   single   Unallocated Total     Slack
-- ---------- -------- -------- -------- ----------- --------- -----
 1 /dev/loop0 10.00GiB        -        -     1.00MiB  10.00GiB     -
 2 /dev/loop1  4.00GiB        -        -    96.00GiB 100.00GiB     -
 3 /dev/loop2        -  1.00GiB 32.00MiB     3.97GiB   5.00GiB     -
-- ---------- -------- -------- -------- ----------- --------- -----
   Total      14.00GiB  1.00GiB 32.00MiB    99.97GiB 115.00GiB 0.00B
   Used       13.02GiB 13.41MiB 16.00KiB

We can see that the smaller ssd fills up before it spills over onto the HDD.

OK, perhaps not very useful since we cannot easily move hot data back on the ssd. However, it could suffice as an emergency overflow to avoid ENOSPC if your workload can survive the reduced iops... :)

@kakra
Copy link
Owner Author

kakra commented Feb 14, 2025

Yes, I think this is intentional behavior of the initial version of the patches: The type numbers are generally used as a priority sort with the *-only doing some kind of exception. My added new types follow a similar exception rule.

I'm not sure if it would be useful to put data on data-only disks first.

I think my idea of using chunk size classes for tiering may be more useful than this side-effect (what I mentioned in a report over at btrfs-todo).

But in theory, type 0 and type 3 should be treated equally as soon as the remaining unallocated space is identical... Did you reach that point? (looks like your loop dev example did exactly that if I followed correctly)

But in the end: Well, "preferred" means "preferred", doesn't it? ;-)

@Forza-tng
Copy link

But in theory, type 0 and type 3 should be treated equally as soon as the remaining unallocated space is identical... Did you reach that point? (looks like your loop dev example did exactly that if I followed correctly)

The loop test did the opposite of this, where the type 0 device was filled before the type 3, even though it was smaller/had less unallocated. I had expected types 3 and 0 were treated equally, but we see that this isn't the case?

It isn't wrong or bad, just something I hadn't thought would happen.

But in the end: Well, "preferred" means "preferred", doesn't it? ;-)

Indeed 😁

asj added 3 commits April 7, 2025 10:47
Refactor the logic in btrfs_read_policy_show() to streamline the
formatting of read policies output. Streamline the space and bracket
handling around the active policy without altering the functional output.
This is in preparation to add more methods.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(),
but this occurs too late for find_live_mirror(), which is invoked by
load_super_root() much earlier than btrfs_init_devices_late().

Fix this by moving the initialization to open_ctree(), before load_super_root().

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
…store

Introduce the `btrfs_read_policy_to_enum` helper function to simplify the
conversion of a string read policy to its corresponding enum value. This
reduces duplication and improves code clarity in `btrfs_read_policy_store`.
The `btrfs_read_policy_store` function has been refactored to use the new
helper.

The parameter is copied locally to allow modification, enabling the
separation of the method and its value. This prepares for the addition of
more functionality in subsequent patches.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
@kakra
Copy link
Owner Author

kakra commented Apr 7, 2025

Added RAID1 read balance patches, see PR description.

@kakra
Copy link
Owner Author

kakra commented Apr 7, 2025

@Forza-tng Looking forward to some benchmark numbers if you want to do them. :-)

@Zygo
Copy link

Zygo commented Apr 7, 2025

I think this is intentional behavior of the initial version of the patches: The type numbers are generally used as a priority sort with the *-only doing some kind of exception. My added new types follow a similar exception rule.

I made a different suggestion when those first came out. The current approach with arbitrary, use-case-based names ("data only", "data preferred", "metadata only", "metadata preferred", and "none") is going to be very confusing, especially with the implied sorting they have to do.

I suggest that the hints should be a bitmask of the kinds of allocation that would be allowed on the device. When allocation fails with the first preference, the order in which new drives are added to the free space search should be specified separately. Splitting it into two parts gives us clean options for "what is allowed" and "when is it allowed".

What is allowed

  • no data, no metadata - i.e. "none", useful for removing multiple drives in parallel
  • data, no metadata - we currently call this "data only"
  • no data, metadata - we currently call this "metadata only"
  • data, metadata - the default, identical to unpatched btrfs, called "data preferred" (IMHO that is a very confusing name for what it does)

It's much clearer what to expect when the options are expressed this way: you get metadata on a device, or you don't. There's no possibility of data spilling onto a device that you didn't ask for.

When it is allowed

To get the "preferred" options, there must be multiple allocation passes, one for each preference level. We need to specify priority for each device within each pass for each allocation type. So we expand the above to more than one bit. e.g. with two bits for each type, you have 4 levels for each:

bits preference name allocation passes
00 always used in all passes
01 tier2 used in second and subsequent passes
10 tier3 used in third pass and beyond
11 never never used

The allocator would then run multiple passes, with each pass adding more drives from the next preference level to search for free space. This loop would stop one level before the lowest, so any device at the lowest preference level would never be used, giving us the "no data" or "no metadata" cases.

Implementation notes

  • If the allocation succeeds, all further passes are skipped.
  • An optimized version also skips passes when there would be no different result, e.g. if there are no devices with preference tier2 or tier3, then we don't need to run subsequent passes. This can be precomputed when the preferences are changed or when a device is added or removed.
  • Currently all existing filesystems have zeros in the bytes where we want to store this field. The bit encoding should make all-zero-bits correspond to the default behavior, and the bit encoding should be sortable. That means "always" would be encoded as 00, while "never" would be encoded as 11, "tier2" is 01, "tier3" is 10.
  • Userspace tools can assist with preset preference values for common use cases. e.g. mkfs.btrfs might notice a large rotating disk and a small SSD, and set up preferred metadata on SSD with overflow in both directions.
  • The kernel should allow all possible combinations of options. Userspace could disallow settings which are usually undesirable (e.g. "none for all devices") if it's possible for a user who knows what they're doing to plug numeric values into the sysfs interface directly.

Use cases

An array of SSDs and HDDs split into metadata and data respectively

Device Type Data preference Metadata preference
1 HDD always (00) never (11)
2 HDD always (00) never (11)
3 NVMe never (11) always (00)
4 NVMe never (11) always (00)

A pair of SSD and HDD (as you might find in a laptop) with bidirectional overflow

Device Type Data preference Metadata preference
1 HDD always (00) tier2 (01)
2 NVMe tier2 (01) always (00)

A pair of SSD and HDD which allows metadata to overflow to HDD, but no data to overflow to SSD

Device Type Data preference Metadata preference
1 HDD always (00) tier2 (01)
2 NVMe never (11) always (00)

A multi-device remove with no overflow allowed to removed devices

Device Type Data preference Metadata preference Notes
1 HDD always (00) always (00) Normal usage device
2 HDD always (00) always (00) Normal usage device
3 To-remove never (11) never (11) Being removed
4 To-remove never (11) never (11) Being removed

This config prevents any new data or metadata from being allocated on the to-remove devices even if that would result in ENOSPC, e.g. if the devices are being removed because they are failing.

A multi-device remove with overflow allowed to removed devices

Device Type Data preference Metadata preference Notes
1 HDD always (00) always (00) Normal usage device
2 HDD always (00) always (00) Normal usage device
3 To-remove tier3 (10) tier3 (10) Used only if necessary
4 To-remove tier3 (10) tier3 (10) Used only if necessary

This config allows allocations to fall back to to-be-removed devices if other devices run out of space, making it a deliberate safety valve to prevent ENOSPC in case the user failed to estimate space correctly.

Multiple tiers

Device Type Data preference Metadata preference Notes
1 HDD always (00) tier3 (10) Slowest
2 SSD tier2 (01) tier2 (01) Medium
3 NVMe tier3 (10) always (00) Fastest

If the slowest drive is the largest one, the default allocator behavior will try to put all the metadata on the slowest drive. This config flips the allocation order in that scenario.

Future considerations

Why do we have more than one level between "always" and "never"? We only need 3 to support the proposal for "preferred", "only", and "none". 3 levels require at least 2 bits, but 2 bits have at least 4 values, so we get a second middle level for free.

We can lean into that: If the user has some complex multi-level tiered storage, or a mashup of old drives with various performance and reliability, or they're doing a complicated reshape, then that second middle level -- or a third bit to extend to a total of 6 levels + always + never -- could be useful.

If we expanded to 8 bits, we'd be able to provide drive-by-drive customized allocation order.

I can't think of a use case for this off the top of my head, but on the other hand, I didn't know about the "none" use case until I found myself in immediate need of it. Maybe someone else will run into it, or it will become part of a more general tiered storage solution.

@kakra
Copy link
Owner Author

kakra commented Apr 7, 2025

@Zygo I like this bitfield suggestion, and it should still be easy to use for simple ordering: Just mask out the bits not used for a request, then compare the remaining. And as you already pointed out, we should rather use 3 bits for tiering: There may be much more complex scenarios of different drives (5400, 7200, 10k rpm) or maybe even network storage (iSCSI, DRBD) involved which have very different performance characteristics.

So we could have RDDDRMMM (reserved, data, metadata) in an 8-bit-field.

I'd still need to figure out why we need multiple passes or how that is different from the current implementation, and I probably need some time for it. I think your multipass idea differs in how free space is considered. We should also look into how we could properly migrate the old settings to the new ones automatically.

And instead of writing raw decimal numbers to the type field, we could probably just expose additional sysfs files like meta_tier and data_tier...

@Zygo
Copy link

Zygo commented Apr 7, 2025

I'd still need to figure out why we need multiple passes or how that is different from the current implementation

Currently we have two semantic preference levels: one for "only", then another for "preferred". IIRC there's no explicit nested loop in the current code--we're just doing a sort by size and preference level, then we loop once through the sorted list of drives, and we cut off the search at two different points to get the preference levels. So my proposed change is to make the outer loop explicit, and run it for as many iterations as there are preference levels in use (or do it in a single loop, but order it properly so it has the same effect as a nested loop).

Making the outer loop explicit might also help clean up some weirdness that currently happens with out-of-size-order preferences and striped profile (e.g. raid5 or raid10) allocations when the device sizes don't line up with preferences the right way, e.g. you can get narrow raid5 stripes if some devices are "preferred" and some "only", because there's no way to separate metadata order from data order in the current implementation--metadata order is strictly the opposite of data order, and that's not always what we want.

The preference data type doesn't have to be bitfields, either in storage or interface. Ordinary integers (one for data, one for metadata, for each drive) will work fine. Thinking of it as generalizations layered on a single-bit yes/no preference concept may be helpful...or it may not.

Currently the patches store everything in a single integer field. One commenter the first time around suggested moving the whole thing into the filesystem tree, e.g. BTRFS_PERSISTENT_ITEM_KEY with an objectid dedicated to allocation preferences, and the offset of each key identifying the device the preferences apply to. That would allow for versioning of the parameter schema and indefinite extension of the fields (e.g. to add a migration policy).

Currently we're using the dev type field in the device superblocks because it's simpler, not because we need to bootstrap allocation directly from superblocks. btrfs has to load the trees before it can allocate anything, so the trees will be available in time to retrieve the allocation preferences.

@kakra
Copy link
Owner Author

kakra commented Apr 7, 2025

@Zygo Thanks, that helped me get the idea...

@kakra
Copy link
Owner Author

kakra commented Apr 9, 2025

First conclusion using latency vs round-robin:

My system uses bcache (mdraid NVMe) backed by four 7200rpm HDDs. Turns out that latency is at an advantage here. I found that the last drive in the setup is never used for reads in latency mode. Overall, latency mode gives a slightly higher throughput in game loading screens (probably due to bcache, not because I apparently use fewer disks for reading).

Investigating the behavior, I found that this last drive only claims to be 7200rpm when it is 5400rpm in reality. fio clearly shows results typical for 5400 rpm drives:

fio --rw=randread --name=IOPS-read --bs=4k --direct=1 --filename=/dev/DEV --numjobs=1 --ioengine=libaio --iodepth=1 --refill_buffers --group_reporting --runtime=60 --time_based

Tested:

Model Device clat 20th percentile IOPS QD1 4k
Western Digital Black WDC WD4005FZBX-00K5WB0 8586 µs 84
Seagate IronWolf Pro ST4000NE001-2MA101 8586 µs 83
Seagate IronWolf Pro ST4000NE001-2MA101 8717 µs 83
Hitachi/HGST Deskstar 7K4000 Hitachi HDS724040ALE640 11076 µs 63

Due to the latency balancer excluding the last disk from most read operations, bcache can be used better because that last disk will be avoided for bcache read caching.

So in a scenario with bcache and/or varying disk types, latency is a clear winner. Without bcache, it would still provide better latency. But in a scenario where throughput matters, you should probably be using round-robin.

I wonder if we can make a hybrid balancer which uses round-robin but weighted/grouped by latency... Because the latency balancer will clearly fail in scenarios where disk latency is only slightly off between each member: It would then prefer to read from fewer devices than it should.

@kakra
Copy link
Owner Author

kakra commented Apr 9, 2025

Added a new read balancer latency-rr:

It tries to combine round-robin and latency into one hybrid approach by using round-robin across a set of stripes within a 120% margin of the best latency. I am currently testing this and have not yet discovered the benefits or downsides but in theory it should prefer the fastest stripes for small requests while it switches over to using all stripes for large continuous requests.

Note: The latency calculation currently uses an average of the full history of requests only - which is bad because it will cancel out changing variations over time. A better approach would be to use an EMA (exponential moving average) with an alpha of 1/8 or 1/16. This requires to sample individual bio latency and thus requires to change structs and code in other parts of btrfs. I'm not very familiar with all the internal structures yet, and the feature is still guarded by CONFIG_BTRFS_EXPERIMENTAL, making that approach more complex. OTOH, having a permanent EMA right in the bio structures of btrfs could prove useful in other areas of btrfs.

This is also why I won't try to eliminate the code duplication yet (to avoid double calculations).

asj and others added 2 commits April 10, 2025 04:52
Add fs_devices::read_cnt_blocks to track read blocks, initialize it in
open_fs_devices() and clean it up in close_fs_devices().
btrfs_submit_dev_bio() increments it for reads when stats tracking is
enabled. Stats tracking is disabled by default and is enabled through
fs_devices::fs_stats when required.

The code is not under the EXPERIMENTAL define, as stats can be expanded
to include write counts and other performance counters, with the user
interface independent of its internal use.

This is an in-memory-only feature, different to the dev error stats.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
CONFIG_BTRFS_EXPERIMENTAL is needed by the RAID1 balancing patches but
we don't want to use the full scope of the 6.13 patch because it also
affects features currently masked via CONFIG_BTRFS_DEBUG.

TODO: Drop during rebase to 6.13 or later.
Original-author: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
@kakra kakra force-pushed the rebase-6.12/btrfs-patches branch from 050cdab to 2df0dd2 Compare April 10, 2025 03:38
@kakra
Copy link
Owner Author

kakra commented Apr 10, 2025

Rebased to newer read policy patchset, my initial merge used an old version from Jan'25.

Important: The module parameter raid1_balancing has been renamed to read_policy.

@kakra kakra force-pushed the rebase-6.12/btrfs-patches branch 2 times, most recently from f1d7497 to e51ca31 Compare April 10, 2025 11:50
@GalaxySnail
Copy link

FWIW, someone mentioned that there are some patches on the linux-btrfs mailing list: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles.

@richardm1
Copy link

I was surprised we couldn't rech around 500MiB/s (2x single device) read speeds

Yep, that doesn't work on btrfs and that's why I see no point in using raid10: It just makes head movement more pronounced and thus lowers throughput and increases latency, but btrfs doesn't read stripes in parallel. OTOH, I'm not sure if mdraid10 would be better here.

I think this needs some tuning of the round-robin size to match the read size of your sequential readers and enough queue depth, so it alternates one and the other mirror/stripe each request.

I've spent far too many hours fooling around with this exact issue on OpenZFS. With a small group of mirrors I always hit a wall somewhere around 1.6-1.7x the sum total of the individual HDD throughputs. Unlike a true/classic RAID-0 where one just reads all drives at max speed and "zippers" the data together, long seq. reads from mirrors require data hopscotch. There's unpredictable interactions with HDD internal read-ahead, how much data sits on a given cylinder, rotational latency, "blown" revolutions because you didn't quite seek to the new cylinder in time to grab the next LBA, etc.

I did discover that how the data is written and distributed across spindles matters when reading it back. If there are any knobs to influence the write allocator "pseudo stripe size" (for lack of a better term) it might be worth twisting them.

I love this kind of stuff and wish I had a btrfs mirror to join the fun. Hopefully soon...

@kakra
Copy link
Owner Author

kakra commented Jun 5, 2025

I love this kind of stuff and wish I had a btrfs mirror to join the fun. Hopefully soon...

You're welcome :-)

> kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> kernel: rcu:         10-....: (2100 ticks this GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187 fqs=1052
> kernel: rcu:         (t=2100 jiffies g=358306033 q=2241752 ncpus=16)
> kernel: CPU: 10 UID: 0 PID: 1524681 Comm: map_0x178e45670 Not tainted 6.12.21-gentoo #1
> kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> kernel: RIP: 0010:btrfs_get_64+0x65/0x110
> kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d 49 08 44 8b 75 0c 48
> kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202
> kernel: RAX: 0000000000001f15 RBX: fffff437ea382200 RCX: fffff437cb891200
> kernel: RDX: 000001922b68df2a RSI: 0000000000000000 RDI: ffffa434c3e66d20
> kernel: RBP: ffffa434c3e66d20 R08: 000001922b68c000 R09: 0000000000000015
> kernel: R10: 6c0000000000000a R11: 0000000009fe7000 R12: 0000000000000f2a
> kernel: R13: 0000000000000001 R14: ffffa43192e6d230 R15: ffffa43160c4c800
> kernel: FS:  000055d07085e6c0(0000) GS:ffffa4452bc80000(0000) knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 00007fff204ecfc0 CR3: 0000000121a0b000 CR4: 00000000001506f0
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> kernel: Call Trace:
> kernel:  <IRQ>
> kernel:  ? rcu_dump_cpu_stacks+0xd3/0x100
> kernel:  ? rcu_sched_clock_irq+0x4ff/0x920
> kernel:  ? update_process_times+0x6c/0xa0
> kernel:  ? tick_nohz_handler+0x82/0x110
> kernel:  ? tick_do_update_jiffies64+0xd0/0xd0
> kernel:  ? __hrtimer_run_queues+0x10b/0x190
> kernel:  ? hrtimer_interrupt+0xf1/0x200
> kernel:  ? __sysvec_apic_timer_interrupt+0x44/0x50
> kernel:  ? sysvec_apic_timer_interrupt+0x60/0x80
> kernel:  </IRQ>
> kernel:  <TASK>
> kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> kernel:  ? btrfs_get_64+0x65/0x110
> kernel:  find_parent_nodes+0x1b84/0x1dc0
> kernel:  btrfs_find_all_leafs+0x31/0xd0
> kernel:  ? queued_write_lock_slowpath+0x30/0x70
> kernel:  iterate_extent_inodes+0x6f/0x370
> kernel:  ? update_share_count+0x60/0x60
> kernel:  ? extent_from_logical+0x139/0x190
> kernel:  ? release_extent_buffer+0x96/0xb0
> kernel:  iterate_inodes_from_logical+0xaa/0xd0
> kernel:  btrfs_ioctl_logical_to_ino+0xaa/0x150
> kernel:  __x64_sys_ioctl+0x84/0xc0
> kernel:  do_syscall_64+0x47/0x100
> kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> kernel: RIP: 0033:0x55d07617eaaf
> kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> kernel: RSP: 002b:000055d07085bc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> kernel: RAX: ffffffffffffffda RBX: 000055d0402f8550 RCX: 000055d07617eaaf
> kernel: RDX: 000055d07085bca0 RSI: 00000000c038943b RDI: 0000000000000003
> kernel: RBP: 000055d07085bea0 R08: 00007fee46c84080 R09: 0000000000000000
> kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> kernel: R13: 000055d07085bf80 R14: 000055d07085bf48 R15: 000055d07085c0b0
> kernel:  </TASK>

The RCU stall could be because there's a large number of backrefs for
some extents and we're spending too much time looping over them
without ever yielding the cpu.

Link: https://lore.kernel.org/linux-btrfs/CAMthOuP_AE9OwiTQCrh7CK73xdTZvHsLTB1JU2WBK6cCc05JYg@mail.gmail.com/T/#md2e3504a1885c63531f8eefc70c94cff571b7a72
Signed-off-by: Kai Krakow <kk@netactive.de>
@kakra
Copy link
Owner Author

kakra commented Jun 23, 2025

Added a test patch that may fix an RCU stall logged to dmesg during heavy meta data operations (e.g. snapshot cleanup during backups).

@kakra
Copy link
Owner Author

kakra commented Jun 28, 2025

Added a test patch that may fix an RCU stall logged to dmesg during heavy meta data operations (e.g. snapshot cleanup during backups).

@Zygo I wonder if this (abf6174) has a meaningful impact on bees... The system that was affected is one that runs bees.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds "btrfs patches" to track experimental features for the btrfs filesystem. The changes implement allocator hints to prefer SSDs for metadata allocation and HDDs for data allocation, plus RAID1 read balancing algorithms to improve performance by distributing read requests across multiple devices.

  • Allocator hint system allowing devices to be tagged with allocation preferences (metadata-only, data-only, etc.)
  • Multiple RAID1 read balancing policies (round-robin, latency-based, queue-based, device-specific)
  • Sysfs interfaces for configuring device allocation hints and read policies at runtime

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
include/uapi/linux/btrfs_tree.h Defines allocation hint constants for device types
fs/btrfs/volumes.h Adds experimental read policy enums and device latency tracking fields
fs/btrfs/volumes.c Implements allocation hint logic and read balancing algorithms
fs/btrfs/sysfs.h Declares read policy configuration functions
fs/btrfs/sysfs.c Implements sysfs interfaces for read policies and device type configuration
fs/btrfs/super.c Adds experimental feature initialization during module load
fs/btrfs/fs.h Adds per-cpu counter for tracking read block statistics
fs/btrfs/disk-io.c Initializes read block counter and sets fs_devices->fs_info
fs/btrfs/bio.c Tracks read operations in the statistics counter
fs/btrfs/backref.c Adds scheduling points to prevent soft lockups
fs/btrfs/Kconfig Adds CONFIG_BTRFS_EXPERIMENTAL configuration option

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

int dev_stats_valid;

/* Counter to record the change of device stats */
/* Counter to record of the change of device stats */
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a grammatical error - 'record of the change' should be 'record the change of' (removing 'of').

Suggested change
/* Counter to record of the change of device stats */
/* Counter to record the change of device stats */

Copilot uses AI. Check for mistakes.
ret = btrfs_device_init_dev_stats(device, path);
if (ret)
goto out;
return ret;
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the 'goto out;' pattern and direct return eliminates cleanup logic. Ensure this doesn't skip any necessary cleanup that was previously handled at the 'out' label.

Suggested change
return ret;
goto out;

Copilot uses AI. Check for mistakes.

#ifdef CONFIG_BTRFS_EXPERIMENTAL
/* Separate value from input in policy:value format. */
if ((value_str = strchr(param, ':'))) {
Copy link

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Assignment within condition is discouraged. Consider separating the assignment: value_str = strchr(param, ':'); if (value_str) {

Suggested change
if ((value_str = strchr(param, ':'))) {
value_str = strchr(param, ':');
if (value_str) {

Copilot uses AI. Check for mistakes.
@eliv
Copy link

eliv commented Oct 15, 2025

Noticed an oops with these patches when doing echo 1 > devinfo/2/type while mount is still ongoing. My btrfs is big so the mount takes 20-30 minutes. Reboot and wait until mount is complete and this worked fine.
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 4 UID: 0 PID: 3520 Comm: bash Not tainted 6.12.52-dirty #2
Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
RIP: 0010:_raw_spin_lock+0x17/0x30
Code: 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 ff 05 e8 c0 d8 5e 31 c0 ba 01 00 00 00 0f b1 17 75 05 c3 cc cc cc cc 89 c6 e9 97 01 00 00 0f 1f 80 00
RSP: 0018:ffffbc13a95837c8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
RBP: 0000000000000008 R08: ffffbc13a9583a07 R09: 0000000000000001
R10: d800000000000000 R11: 0000000000000001 R12: ffff9bee913db000
R13: 0000000000000000 R14: 00000000fffffffb R15: ffff9bee913db000
FS: 00007fd6e270f740(0000) GS:ffff9bfddfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000008d9986004 CR4: 00000000003706f0
Call Trace:

__reserve_bytes+0x70/0x720 [btrfs]
? get_page_from_freelist+0x343/0x1570
btrfs_reserve_metadata_bytes+0x1d/0xd0 [btrfs]
btrfs_use_block_rsv+0x153/0x220 [btrfs]
btrfs_alloc_tree_block+0x83/0x580 [btrfs]
btrfs_force_cow_block+0x129/0x620 [btrfs]
btrfs_cow_block+0xcd/0x230 [btrfs]
btrfs_search_slot+0x566/0xd60 [btrfs]
? kmem_cache_alloc_noprof+0x106/0x2f0
btrfs_update_device+0x91/0x1d0 [btrfs]
btrfs_devinfo_type_store+0xb8/0x140 [btrfs]
kernfs_fop_write_iter+0x14c/0x200
vfs_write+0x289/0x440
ksys_write+0x6d/0xf0
trace_clock_x86_tsc+0x20/0x20
? do_wp_page+0x838/0xf90
? __do_sys_newfstat+0x68/0x70
? __pte_offset_map+0x1b/0xf0
? __handle_mm_fault+0xa6c/0x10f0
? __count_memcg_events+0x53/0xf0
? handle_mm_fault+0x1c4/0x2d0
? do_user_addr_fault+0x334/0x620
? arch_exit_to_user_mode_prepare.isra.0+0x11/0x90
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fd6e27a1687
Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
RSP: 002b:00007ffecb401260 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fd6e270f740 RCX: 00007fd6e27a1687
RDX: 0000000000000002 RSI: 0000557a2c38ad20 RDI: 0000000000000001
RBP: 0000557a2c38ad20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
R13: 00007fd6e28fa5c0 R14: 00007fd6e28f7e80 R15: 0000000000000000

Modules linked in: rpcsec_gss_krb5 nfsv3 nfsv4 dns_resolver nfs netfs zram lz4hc_compress lz4_compress dm_crypt bonding tls ipmi_ssif intel_rapl_msr nfsd binfmt_misc auth_rpcgss nfs_acl lockd grace intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp rapl intel_cstate s>
intel_pmc_bxt ixgbe ehci_pci iTCO_vendor_support xfrm_algo gf128mul libata mpt3sas xhci_hcd ehci_hcd watchdog crypto_simd mdio_devres libphy cryptd raid_class usbcore scsi_transport_sas mdio igb scsi_mod wmi usb_common i2c_i801 lpc_ich scsi_common i2c_smbus i2c_algo_bit dca
CR2: 0000000000000008
---[ end trace 0000000000000000 ]---
RIP: 0010:_raw_spin_lock+0x17/0x30
Code: 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 ff 05 e8 c0 d8 5e 31 c0 ba 01 00 00 00 0f b1 17 75 05 c3 cc cc cc cc 89 c6 e9 97 01 00 00 0f 1f 80 00
RSP: 0018:ffffbc13a95837c8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
RBP: 0000000000000008 R08: ffffbc13a9583a07 R09: 0000000000000001
R10: d800000000000000 R11: 0000000000000001 R12: ffff9bee913db000
R13: 0000000000000000 R14: 00000000fffffffb R15: ffff9bee913db000
FS: 00007fd6e270f740(0000) GS:ffff9bfddfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000008d9986004 CR4: 00000000003706f0

@Forza-tng
Copy link

@eliv interesting. A workaround could be to switch your filesystem to block group tree. It would make your fs mount in seconds instead of half an hour.

https://btrfs.readthedocs.io/en/latest/btrfstune.html

@kakra
Copy link
Owner Author

kakra commented Oct 15, 2025

A workaround could be to switch your filesystem to block group tree

@eliv @Forza-tng We should still fix this. But I'm not sure if I want to fix it for 6.12 because there's an easy work-around: Just wait, until the mount settled, only then modify the flags. It's not something you'd change on a daily basis anyways.

For next LTS (most likely 6.18) I will reduce the stripe-selection patches to add only "queue" (the 6.18 kernel brings round-robin and pid by default). According to all the benchmarks, there's absolutely no use case where latency or rr-latency has a real benefit. Then during the following year (2026) I'll want to improve the quality of the patches (for stripe selection at least, chunk type per device type is a very different thing, and there's ongoing work in the kernel currently preparing more easy integration of such patches), add more benchmarks, and then suggest to upstream.

@Forza-tng
Copy link

What happens if we write to other btrfs sysfs entries, like bg_reclaim_threshold, before a mount is completed? Is it a general issue with sysfs handling, or just these patches?

@kakra
Copy link
Owner Author

kakra commented Oct 15, 2025

What happens if we write to other btrfs sysfs entries, like bg_reclaim_threshold, before a mount is completed? Is it a general issue with sysfs handling, or just these patches?

Probably just those sysfs knobs which actually access per-device info - reading or writing may not actually make a difference here: If the info is not available yet, then there will be dragons... er, null pointers...

I think there are only few sysfs entries for btrfs which don't initialize early, or which aren't available via superblock. So this reported incident is more or less an oversight. I didn't develop those patches from scratch, they are based on other peoples work. So I should fix what I maintain. And if I identify problems in the other sysfs parts, I might submit a patch upstream. But I think a problem there is most unlikely because udev (and thus btrfsprogs) will fiddle around with some sysfs entries very early during discovery and mounting. An existing bug would most likely have already been found. ;-)

@CHN-beta
Copy link

Thank you for your great patch, it helped me a lot on my NAS.
I wonder if the patch will be available for kernel 6.18? I also want to use the patch on my tablet, but it requires kernel 6.18 to properly drive the graphic card.

@kakra
Copy link
Owner Author

kakra commented Dec 10, 2025

Thank you for your great patch, it helped me a lot on my NAS. I wonder if the patch will be available for kernel 6.18? I also want to use the patch on my tablet, but it requires kernel 6.18 to properly drive the graphic card.

Yes, it will be coming to the next LTS. But I usually do not work on it before the dot 1 or dot 2 release, although I'll try to base my patches off the dot 0 releases. Before publishing the patches, I do at least test them with my system fully booting and running the kernel but I learned the hard way, I should not do that with dot 0 releases, and sometimes even not dot 1 releases.

To give you a perspective, I'm going to work on this during the holidays. I'll probably publish the patches still in this year but if I don't, it'll be ready early January.

I hope this is okay. xpadneo folks are also waiting on the 6.18 compatibility release already... :-D

@CHN-beta
Copy link

Thank you for your great patch, it helped me a lot on my NAS. I wonder if the patch will be available for kernel 6.18? I also want to use the patch on my tablet, but it requires kernel 6.18 to properly drive the graphic card.

Yes, it will be coming to the next LTS. But I usually do not work on it before the dot 1 or dot 2 release, although I'll try to base my patches off the dot 0 releases. Before publishing the patches, I do at least test them with my system fully booting and running the kernel but I learned the hard way, I should not do that with dot 0 releases, and sometimes even not dot 1 releases.

To give you a perspective, I'm going to work on this during the holidays. I'll probably publish the patches still in this year but if I don't, it'll be ready early January.

I hope this is okay. xpadneo folks are also waiting on the 6.18 compatibility release already... :-D

Early January is totally OK for me, since my NAS will continue using the 6.12 kernel until about June.

Thank you again for your great work!

@kakra
Copy link
Owner Author

kakra commented Dec 11, 2025

Early January is totally OK for me, since my NAS will continue using the 6.12 kernel until about June.

Just out of my curiosity, and to focus my efforts better: Which features of the patch set are helping you the most?

Thank you again for your great work!

Great to hear, this is very appreciated. <3

@CHN-beta
Copy link

Just out of my curiosity, and to focus my efforts better: Which features of the patch set are helping you the most?

Allocator hints (allocate metadata to SSD while keeping data in HDD) helped me a lot, it greatly enhanced my RAID performance. I also set the read policy to queue, but I did not intentionally benchmark its performance.

@MarkRose
Copy link

I'm particularly interested in the allocator hints. I'm adding small Optane devices to my systems for metadata.

@richardm1
Copy link

Hi, sorry to bother you guys. Was [round-robin] "ungated" from behind CONFIG_BTRFS_EXPERIMENTAL=Y in 6.18?

@kakra
Copy link
Owner Author

kakra commented Dec 13, 2025

Hi, sorry to bother you guys. Was [round-robin] "ungated" from behind CONFIG_BTRFS_EXPERIMENTAL=Y in 6.18?

No. I'm currently working on the rebase to 6.18, and I've "ungated" it so you can use it without enabling all the experimental stuff. My rebase compiles, I am currently reviewing it, and fix a non-critical bug that has been reported here. After this, I'll test drive the patches, and then publish them. You'll get a notification from this PR when I publish them, so stay tuned.

Thanks. :-)

@kakra kakra added the done To be superseded by next LTS label Dec 13, 2025
@kakra
Copy link
Owner Author

kakra commented Dec 14, 2025

Superseded by #40

@kakra kakra closed this Dec 14, 2025
kakra added a commit that referenced this pull request Dec 16, 2025
v2: Adds a check to prevent modification while the file system is still mounting.

Todo:

- Transactions should not be triggered from sysfw writes, see:
  https://lore.kernel.org/linux-btrfs/20251213200920.1808679-1-kai@kaishome.de/

Link: #36 (comment)
Reported-by: Eli Venter <eli@genedx.com>
Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

done To be superseded by next LTS

Projects

None yet

Development

Successfully merging this pull request may close these issues.